Orthographic Case Restoration Using Supervised Learning Without Manual Annotation
نویسندگان
چکیده
One challenge in text processing is the treatment of case insensitive documents such as speech recognition results. The traditional approach is to re-train a language model excluding case-related features. This paper presents an alternative two-step approach whereby a preprocessing module (Step 1) is designed to restore case-sensitive form to feed the core system (Step 2). Step 1 is implemented as a Hidden Markov Model trained on a large raw corpus of case sensitive documents. It is demonstrated that this approach (i) outperforms the feature exclusion approach for Named Entity tagging, (ii) leads to limited degradation for semantic parsing and relationship extraction, (iii) reduces system complexity, and (iv) has wide applicability: the restored text can feed both statistical model and rule-based systems.
منابع مشابه
Self-Learning to Detect and Segment Cysts in Lung CT Images without Manual Annotation
Image segmentation is a fundamental problem in medical image analysis. In recent years, deep neural networks achieve impressive performances on many medical image segmentation tasks by supervised learning on large manually annotated data. However, expert annotations on big medical datasets are tedious, expensive or sometimes unavailable. Weakly supervised learning could reduce the effort for an...
متن کاملSuperpixel clustering with deep features for unsupervised road segmentation
Vision-based autonomous driving requires classifying each pixel as corresponding to road or not, which can be addressed using semantic segmentation. Semantic segmentation works well when used with a fully supervised model, but in practice, the required work of creating pixel-wise annotations is very expensive. Although weakly supervised segmentation addresses this issue, most methods are not de...
متن کاملImplicitly-Supervised Learning in Spoken Language Interfaces: An Application to the Confidence Annotation Problem
In this paper we propose the use of a novel learning paradigm in spoken language interfaces – implicitly-supervised learning. The central idea is to extract a supervision signal online, directly from the user, from certain patterns that occur naturally in the conversation. The approach eliminates the need for developer supervision and facilitates online learning and adaptation. As a first step ...
متن کاملTransfer Learning by Ranking for Weakly Supervised Object Annotation
Object detectors [5] locate objects of interest in images and have many applications including image tagging, consumer photography, and surveillance. Most existing object detectors take a fully supervised learning (FSL) approach, where all the training images are manually annotated with the object location. However, manual annotation of hundreds of object categories is time-consuming, laborious...
متن کاملA Multi-dimensional Annotation Scheme for Opinion Mining from Unstructured Data
We present a comprehensive annotation scheme for opinion mining of product review data. We motivate our multi-dimensional annotation scheme both from a linguistic perspective and a task based perspective, and contrast it with other existing annotation schemes for opinion mining. A coding manual and annotated corpus of product review data, both of which were prepared using a stringent methodolog...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003